IBM HR Analytics Employee Attrition and Performance Dataset

IBM HR Analytics Employee Attrition and Performance Dataset

In this study, we analyze HR data available from kaggle.com. We have already analyzed the dataset and prepared it for modeling (see the details here).

Standardized Dataset

Problem Description

In the dataset, Attrition represents whether an employee is churned or not. We would like to create a predictive model that predicts this feature.

X and y sets

Correlation of the features

Training and testing sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Modeling: The Sequential Model

The Sequential model provides a linear stack of layers. A Sequential model can be used for a stack of layers where each layer has exactly one input tensor and one output tensor.

Developing a model

Training the model

The model performance

Some of the metrics that we use here to mesure the accuracy: \begin{align} \text{Confusion Matrix} = \begin{bmatrix}T_p & F_p\\ F_n & T_n\end{bmatrix}. \end{align}

where $T_p$, $T_n$, $F_p$, and $F_n$ represent true positive, true negative, false positive, and false negative, respectively.

\begin{align} \text{Precision} &= \frac{T_{p}}{T_{p} + F_{p}},\\ \text{Recall} &= \frac{T_{p}}{T_{p} + F_{n}},\\ \text{F1} &= \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\\ \text{Balanced-Accuracy (bACC)} &= \frac{1}{2}\left( \frac{T_{p}}{T_{p} + F_{n}} + \frac{T_{n}}{T_{n} + F_{p}}\right ) \end{align}

The accuracy can be a misleading metric for imbalanced data sets. Here, over 88 percent of the sample has negative (No) and about 12 percent has positive (Yes) values. In these cases, a balanced accuracy (bACC) [6] is recommended that normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two.


References

  1. Kaggle Dataset: IBM HR Analytics Employee Attrition & Performance
  2. scikit-learn: classifiers
  3. scikit-learn: Metrics and scoring: quantifying the quality of predictions
  4. Confusion matrix
  5. The Sequential model
  6. Mower, Jeffrey P. "PREP-Mt: predictive RNA editor for plant mitochondrial genes." BMC bioinformatics 6.1 (2005): 1-15.